Find all valid adjustment sets if we want to estimate the effect of Exercise on Health
BONUS: What happens if we control for Motivation? Why?
exercise: simulate simple confounding
Copy the code below.
Run the base simulation and observe results
Modify the simulation parameters:
Change the strength of the confounding (modify the 0.5, 0.8, and 0.6 coefficients)
Change the sample size (N)
Add a true causal effect (modify Y calculation to include X)
Answer these questions:
What happens to the bias in the naive estimate as you increase the strength of confounding?
How does sample size affect the precision of your estimates?
When does controlling for Z fail to recover the true causal effect?
Code
set.seed(9)#number of simsN =1000# Generate dataU <-rnorm(N) # Unobserved confounderX <-rnorm(N, mean =0.5* U) # Treatment affected by UY <-rnorm(N, mean =0.8* U) # Outcome affected by UZ <-rnorm(N, mean =0.6* U) # Observed variable that captures Ud <-data.frame(X, Y, Z)# Fit modelsflist1 <-alist( Y ~dnorm(mu, sigma), mu <- a + bX*X, a ~dnorm(0, .5), bX ~dnorm(0, .25), sigma ~dexp(1))m1 <-brm(data=d,family=gaussian, Y ~1+ X,prior =c( prior(normal(0, .50), class=Intercept),prior(normal(0, .25), class=b),prior(exponential(1), class=sigma)),iter=2000, warmup=1000, seed=3, chains=1)posterior_summary(m1)m2 <-brm(data=d,family=gaussian, Y ~1+ X + Z,prior =c( prior(normal(0, .50), class=Intercept),prior(normal(0, .25), class=b),prior(exponential(1), class=sigma)),iter=2000, warmup=1000, seed=3, chains=1)posterior_summary(m2)post.1<-as_draws_df(m1)post.2<-as_draws_df(m2)results_df =data.frame(naive = post.1$b_X,adjusted = post.2$b_X)results_df %>%pivot_longer(everything()) %>%ggplot(aes(x = value, fill = name)) +geom_density(alpha = .5) +geom_vline(aes(xintercept =0), linetype ="dashed")
bad controls
“Bad controls” can create bias in three main ways:
Collider bias (as we saw in the previous exercise)
Precision parasites (reduce precision without addressing confounding)
Bias amplification (making existing bias worse)
Warning signs of bad controls:
Post-treatment variables
Variables affected by both treatment and outcome
Variables that don’t address actual confounding paths
exercise
Use this code to simulate new variables:
n =100# Z affects X but is not a confounderZ <-rnorm(n)X <-rnorm(n, mean = Z)Y <-rnorm(n, mean = X) # True effect of X on Y is 1
Using different sample sizes (n = 50, 100, 1000), test two models exploring the relationship between X (exposure) and Y (outcome). For each sample size, compare:
Standard errors without controlling for Z
Standard errors when controlling for Z
How does sample size affect the impact of the precision parasite (Z)? Under what conditions is the precision loss most severe?
exercise
Use this code to simulate new variables:
n =100conf_strength =1# U is unmeasured confounderU <-rnorm(n)Z <-rnorm(n)X <-rnorm(n, mean = Z + conf_strength * U)Y <-rnorm(n, mean = conf_strength * U) # No true effect of X
Using different confounder strengths (0.5, 1, 2), test two models exploring the relationship between X (exposure) and Y (outcome). For each sample size, compare:
Standard errors without controlling for Z
Standard errors when controlling for Z
Questions: * What happens to the bias when you control for Z? * How does the strength of the confounding affect the amount of bias amplification? * Can you explain why this happens using the DAG?
exercise
Use the simulation code provided in the last two exercises to create a new scenario with both a precision parasite variable (Z1) and a bias amplification variable (Z2).
Questions:
What happens to our estimates when we control for both variables?
Is it better to:
Control for neither
Control for just one (which one?)
Control for both
How can we use DAGs to decide which controls to include?
table 2 fallacy
The table 2 fallacy, first described by Westreich and Greenland in 2013, refers to a common misinterpretation in epidemiology and statistics when researchers present multiple adjusted effect estimates in a single table (often “Table 2” in academic papers).
The fallacy occurs when researchers interpret all coefficients in a multiple regression model as total effects, when in fact some are direct effects conditional on the other variables in the model. This can lead to incorrect causal interpretations, particularly when some variables are mediators (lying on the causal pathway between exposure and outcome).
For example, imagine studying how education affects income, with job type as a mediator:
Education → Job Type → Income
Education also directly affects Income
If you include both education and job type in the same regression model, the coefficient for education represents only its direct effect on income (not mediated through job type), not its total effect. However, researchers often mistakenly interpret it as the total effect.
This fallacy becomes particularly problematic when:
The research question involves understanding total causal effects
There are multiple pathways between variables
Some variables act as both confounders and mediators
To avoid this fallacy, researchers should:
Clearly specify which effects (total vs direct) they’re interested in
Use appropriate methods like path analysis or mediation analysis when studying causal relationships
Be precise in their interpretation of regression coefficients
Consider creating separate models for different research questions